Data Analysis

Comprehensive Data Cleaning & Exploratory Analysis of Job Market Trends

Authors
Affiliation

Haoran Guo

Boston University

Junze Tan

Boston University

import pandas as pd
df = pd.read_csv('lightcast_job_postings.csv')
#4.1
columns = df.columns.tolist()
for i in range(0, len(columns), 5):
    print(columns[i:i+5])
['ID', 'LAST_UPDATED_DATE', 'LAST_UPDATED_TIMESTAMP', 'DUPLICATES', 'POSTED']
['EXPIRED', 'DURATION', 'SOURCE_TYPES', 'SOURCES', 'URL']
['ACTIVE_URLS', 'ACTIVE_SOURCES_INFO', 'TITLE_RAW', 'BODY', 'MODELED_EXPIRED']
['MODELED_DURATION', 'COMPANY', 'COMPANY_NAME', 'COMPANY_RAW', 'COMPANY_IS_STAFFING']
['EDUCATION_LEVELS', 'EDUCATION_LEVELS_NAME', 'MIN_EDULEVELS', 'MIN_EDULEVELS_NAME', 'MAX_EDULEVELS']
['MAX_EDULEVELS_NAME', 'EMPLOYMENT_TYPE', 'EMPLOYMENT_TYPE_NAME', 'MIN_YEARS_EXPERIENCE', 'MAX_YEARS_EXPERIENCE']
['IS_INTERNSHIP', 'SALARY', 'REMOTE_TYPE', 'REMOTE_TYPE_NAME', 'ORIGINAL_PAY_PERIOD']
['SALARY_TO', 'SALARY_FROM', 'LOCATION', 'CITY', 'CITY_NAME']
['COUNTY', 'COUNTY_NAME', 'MSA', 'MSA_NAME', 'STATE']
['STATE_NAME', 'COUNTY_OUTGOING', 'COUNTY_NAME_OUTGOING', 'COUNTY_INCOMING', 'COUNTY_NAME_INCOMING']
['MSA_OUTGOING', 'MSA_NAME_OUTGOING', 'MSA_INCOMING', 'MSA_NAME_INCOMING', 'NAICS2']
['NAICS2_NAME', 'NAICS3', 'NAICS3_NAME', 'NAICS4', 'NAICS4_NAME']
['NAICS5', 'NAICS5_NAME', 'NAICS6', 'NAICS6_NAME', 'TITLE']
['TITLE_NAME', 'TITLE_CLEAN', 'SKILLS', 'SKILLS_NAME', 'SPECIALIZED_SKILLS']
['SPECIALIZED_SKILLS_NAME', 'CERTIFICATIONS', 'CERTIFICATIONS_NAME', 'COMMON_SKILLS', 'COMMON_SKILLS_NAME']
['SOFTWARE_SKILLS', 'SOFTWARE_SKILLS_NAME', 'ONET', 'ONET_NAME', 'ONET_2019']
['ONET_2019_NAME', 'CIP6', 'CIP6_NAME', 'CIP4', 'CIP4_NAME']
['CIP2', 'CIP2_NAME', 'SOC_2021_2', 'SOC_2021_2_NAME', 'SOC_2021_3']
['SOC_2021_3_NAME', 'SOC_2021_4', 'SOC_2021_4_NAME', 'SOC_2021_5', 'SOC_2021_5_NAME']
['LOT_CAREER_AREA', 'LOT_CAREER_AREA_NAME', 'LOT_OCCUPATION', 'LOT_OCCUPATION_NAME', 'LOT_SPECIALIZED_OCCUPATION']
['LOT_SPECIALIZED_OCCUPATION_NAME', 'LOT_OCCUPATION_GROUP', 'LOT_OCCUPATION_GROUP_NAME', 'LOT_V6_SPECIALIZED_OCCUPATION', 'LOT_V6_SPECIALIZED_OCCUPATION_NAME']
['LOT_V6_OCCUPATION', 'LOT_V6_OCCUPATION_NAME', 'LOT_V6_OCCUPATION_GROUP', 'LOT_V6_OCCUPATION_GROUP_NAME', 'LOT_V6_CAREER_AREA']
['LOT_V6_CAREER_AREA_NAME', 'SOC_2', 'SOC_2_NAME', 'SOC_3', 'SOC_3_NAME']
['SOC_4', 'SOC_4_NAME', 'SOC_5', 'SOC_5_NAME', 'LIGHTCAST_SECTORS']
['LIGHTCAST_SECTORS_NAME', 'NAICS_2022_2', 'NAICS_2022_2_NAME', 'NAICS_2022_3', 'NAICS_2022_3_NAME']
['NAICS_2022_4', 'NAICS_2022_4_NAME', 'NAICS_2022_5', 'NAICS_2022_5_NAME', 'NAICS_2022_6']
['NAICS_2022_6_NAME']
#4.2
columns_to_drop = [
    "ID", "URL", "ACTIVE_URLS", "DUPLICATES", "LAST_UPDATED_TIMESTAMP",
    "NAICS2", "NAICS3", "NAICS4", "NAICS5", "NAICS6",
    "SOC_2", "SOC_3", "SOC_5"
]
df.drop(columns=columns_to_drop, inplace=True)

4.2: The above columns are to be dropped because they are not relevant in our analysis. For example, ID is not an insightful variable here because it entails little numerical value for our analysis (i.e. different IDs do not represent different meaningful characteristics for the posted job). Other variables such as NAICS4 or NAICS5 will be deleted because they are out-dated.

pd.DataFrame(df.columns, columns=["Remaining Columns"])
Remaining Columns
0 LAST_UPDATED_DATE
1 POSTED
2 EXPIRED
3 DURATION
4 SOURCE_TYPES
... ...
113 NAICS_2022_4_NAME
114 NAICS_2022_5
115 NAICS_2022_5_NAME
116 NAICS_2022_6
117 NAICS_2022_6_NAME

118 rows × 1 columns

Dropping irrelavant columns will make our database more compact and will save us cognitive efforts when we are interpreting graphs, as now we don’t need to interpret the bars or plots for those irrelavant columns. We can therefore focus on more meaningful columns such as the SALARY or industry type.

#4.4
import missingno as msno
import matplotlib.pyplot as plt

msno.heatmap(df)
plt.title("Missing Values Heatmap")
plt.show()

# Drop columns with more than 50% missing values
keep_columns = ["INDUSTRY", "SALARY"]
missing_percent = df.isnull().mean()
cols_to_drop = missing_percent[
    (missing_percent > 0.5) & (~missing_percent.index.isin(keep_columns))
].index
df.drop(columns=cols_to_drop, inplace=True)
# Fill missing values
for col in df.columns:
    if df[col].dtype in ['int64', 'float64']:
        df[col] = df[col].fillna(df[col].median())
    elif df[col].dtype == 'object':
        df[col] = df[col].fillna("Unknown")

Here we are handling the missing values in the INDUSTRY, SALARY, NAICS2_NAME, and other columns. We drop all columns with >50% missing values, and for the salary and the NAICS2_NAME (containing industry names) columns, we fill NAs with the median value and ‘Unknown’, respectively.

df.head(3)
LAST_UPDATED_DATE POSTED EXPIRED DURATION SOURCE_TYPES SOURCES TITLE_RAW BODY MODELED_EXPIRED MODELED_DURATION ... NAICS_2022_2 NAICS_2022_2_NAME NAICS_2022_3 NAICS_2022_3_NAME NAICS_2022_4 NAICS_2022_4_NAME NAICS_2022_5 NAICS_2022_5_NAME NAICS_2022_6 NAICS_2022_6_NAME
0 2024-09-06 2024-06-02 2024-06-08 6.0 [\n "Company"\n] [\n "brassring.com"\n] Enterprise Analyst (II-III) 31-May-2024\n\nEnterprise Analyst (II-III)\n\n... 2024-06-08 6.0 ... 44 Retail Trade 441 Motor Vehicle and Parts Dealers 4413 Automotive Parts, Accessories, and Tire Retailers 44133 Automotive Parts and Accessories Retailers 441330 Automotive Parts and Accessories Retailers
1 2024-08-02 2024-06-02 2024-08-01 18.0 [\n "Job Board"\n] [\n "maine.gov"\n] Oracle Consultant - Reports (3592) Oracle Consultant - Reports (3592)\n\nat SMX i... 2024-08-01 16.0 ... 56 Administrative and Support and Waste Managemen... 561 Administrative and Support Services 5613 Employment Services 56132 Temporary Help Services 561320 Temporary Help Services
2 2024-09-06 2024-06-02 2024-07-07 35.0 [\n "Job Board"\n] [\n "dejobs.org"\n] Data Analyst Taking care of people is at the heart of every... 2024-06-10 8.0 ... 52 Finance and Insurance 524 Insurance Carriers and Related Activities 5242 Agencies, Brokerages, and Other Insurance Rela... 52429 Other Insurance Related Activities 524291 Claims Adjusting

3 rows × 109 columns

#4.5
df = df.drop_duplicates(subset=["TITLE", "COMPANY", "LOCATION", "POSTED"], keep="first")
import plotly.io as pio
pio.renderers.default = 'notebook'
#5.1.1
import plotly.express as px
industry_counts = df["NAICS2_NAME"].value_counts()

fig = px.bar(industry_counts, 
             title="Job Postings by Industry", 
             labels={'index': 'Industry', 'value': 'Number of Job Postings'},
             text_auto=True)
fig.show()

5.1.1: The graph of job posting by industry is chosen because we need to know which industries are more ‘eager’ to hiring. This can guide job seekers about which industries to choose when deciding where to send their resumes. The graph shows that the Professional, Scientific, and Technical Services industry dominates in terms of job posted (with more than 22k jobs posted). This implies that more opportunities exist in that industry and job seekers should think about advancing their skills corresponding to the skill requirements in this industry.

#5.1.2
fig = px.box(df, 
             x="NAICS2_NAME", 
             y="SALARY", 
             title="Salary Distribution by Industry",
             labels={"Industry": "Industry", "Salary": "Salary"},
             points="all") 
fig.update_layout(xaxis_tickangle=-45)  
fig.show()
df.groupby("NAICS2_NAME")["SALARY"].mean().sort_values(ascending=False)
NAICS2_NAME
Information                                                                 129838.706942
Professional, Scientific, and Technical Services                            122545.236573
Accommodation and Food Services                                             121704.673684
Retail Trade                                                                119727.348942
Manufacturing                                                               118032.224153
Finance and Insurance                                                       117732.917823
Utilities                                                                   116621.473846
Construction                                                                115786.905537
Management of Companies and Enterprises                                     114972.733209
Wholesale Trade                                                             113760.985653
Mining, Quarrying, and Oil and Gas Extraction                               113593.815152
Unclassified Industry                                                       113440.233087
Transportation and Warehousing                                              111254.506798
Administrative and Support and Waste Management and Remediation Services    110401.264970
Agriculture, Forestry, Fishing and Hunting                                  109806.578947
Health Care and Social Assistance                                           109014.679035
Other Services (except Public Administration)                               105049.230715
Real Estate and Rental and Leasing                                          103524.830334
Arts, Entertainment, and Recreation                                         102223.296667
Educational Services                                                         99009.963886
Public Administration                                                        92306.371459
Name: SALARY, dtype: float64

5.1.2: In order to gain a deeper understanding about salaries of jobs in different industries, we need a box plot to show us the salary distributions for various industries. Results indicate that the mean salaries among all the industries are almost at the same level (around 100K). However, jobs in certain industries have the potential to reach substantially more salaries than the average. These industries include Administrative and Support and Waste Management and Remediation Services, Information, and Health Care and Social Assistance. Job seekers may want to look into these industries if they aspire to get higher salaries when they advance in their positions.

fig = px.pie(df, names="REMOTE_TYPE_NAME", title="Remote vs. On-Site Jobs")
fig.show()

5.1.3: The remost vs. on-site job chart enables us to know whether the jobs posted allow remote working or require on-site working. The data here does not allow us to see a bigger picture due to the 78.3% None value in the column, but from the remaining data we do see that 17% of jobs allow remote working. This can imply that remote work has gained its momentum, and that job seekers nowadays are faced with both opportunities and challenges brought by remote work. For the opportunities, workers now have more flexibility in working location and are no longer binded by physical offices, meaning that they will save time of daily commute and can have better work-life balance. The presence of AI and computer technology also facilitates remote working by enabling workers to work anywhere with internet connections. However, the downside of remote working is that social interactions among workers will decrease. Team spirits and cooperation will diminish when workers are not present in the same physical space. Future job seekers, as well as hiring companies, need to evaluate the ups and downs of this remote working trend in order to optimize work motivation.